Create tests/distributed/test_mnnvl_alltoall.py#35241
Create tests/distributed/test_mnnvl_alltoall.py#35241puririshi98 wants to merge 47 commits intovllm-project:mainfrom
Conversation
all 5 tests pass on 8xh100 w/ latest nvidia stack Signed-off-by: Rishi Puri <riship@nvidia.com>
There was a problem hiding this comment.
Code Review
The pull request introduces a new test file for MNNVL AllToAll operations, ensuring the correct functionality and initialization of FlashInfer components within a distributed environment. The tests cover manager initialization, workspace reinitialization, and the ensure_initialized method, as well as a custom communicator wrapper. The setup correctly handles multi-GPU environments and checks for necessary system capabilities like SYS_PTRACE.
One area for improvement is the broad exception handling in has_sys_ptrace_capability, which could mask underlying issues.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Catching a generic Exception can hide specific issues that might arise during the file reading or parsing of /proc/self/status. It's generally better to catch more specific exceptions (e.g., IOError, ValueError) to avoid masking other potential bugs. While this function is for capability checking, a more precise exception handling would improve maintainability and debugging.
| except Exception: | |
| pass | |
| except (IOError, ValueError) as e: | |
| # Log the error for debugging purposes, but continue with alternative checks | |
| print(f"Warning: Error reading /proc/self/status: {e}") |
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
|
recent changes: on dgxh100 |
Signed-off-by: Rishi Puri <riship@nvidia.com>
Yes, this test refers to the MNNVL all2allv implementation from PR #21003. The test file validates FlashInferAllToAllManager which:
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @puririshi98, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Rishi Puri <riship@nvidia.com>
Replace deprecated torch.cuda API calls with torch.accelerator equivalents: - torch.cuda.set_device() → torch.accelerator.set_device_index() - torch.cuda.current_device() → torch.accelerator.current_device_index() - torch.cuda.device_count() → torch.accelerator.device_count() Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Claude <claude@anthropic.com>
|
How hard would it be to add Might be better to make the names consistent in this test for clarity? |
WIP |
Signed-off-by: Rishi Puri <riship@nvidia.com>
hjjq
left a comment
There was a problem hiding this comment.
Hi @puririshi98 I've left some questions. PTAL
Signed-off-by: Rishi Puri <riship@nvidia.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
all 5 tests pass on 8xh100 w/ latest nvidia stack
This is part of the NVIDIA effort to add CI to upstream github
GPU Hours Estimate: